Mapping Rules for Building a Tunisian Dialect Lexicon and Generating Corpora
نویسندگان
چکیده
Nowadays in tunisia, the arabic Tunisian Dialect (TD) has become progressively used in interviews, news and debate programs instead of Modern Standard Arabic (MSA). Thus, this gave birth to a new kind of language. Indeed, the majority of speech is no longer made in MSA but alternates between MSA and TD. This situation has important negative consequences on Automatic Speech Recognition (ASR): since the spoken dialects are not officially written and do not have a standard orthography, it is very costly to obtain adequate annotated corpora to use for training language models and building vocabulary. There are neither parallel corpora involving Tunisian dialect and MSA nor dictionaries. In this paper, we describe a method for building a bilingual dictionary using explicit knowledge about the relation between TD and MSA. We also present an automatic process for creating Tunisian Dialect
منابع مشابه
Building bilingual lexicon to create Dialect Tunisian corpora and adapt language model
Since the Tunisian revolution, Tunisian Dialect (TD) used in daily life, has became progressively used and represented in interviews, news and debate programs instead of Modern Standard Arabic (MSA). This situation has important negative consequences for natural language processing (NLP): since the spoken dialects are not officially written and do not have standard orthography, it is very costl...
متن کاملMorphological Analysis of Tunisian Dialect
In this paper, we address the problem of the morphological analysis of an Arabic dialect. We propose a method to adapt an Arabic morphological analyzer for the Tunisian dialect (TD). In order to do that, we create a lexicon for the TD. The creation of the lexicon is done in two steps. The first step consists in adapting a Modern Standard Arabic (MSA) lexicon. We adapted a list of MSA derivation...
متن کاملA Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition
In this paper we describe an effort to create a corpus and phonetic dictionary for Tunisian Arabic Automatic Speech Recognition (ASR). The corpus, named TARIC (Tunisian Arabic Railway Interaction Corpus) has a collection of audio recordings and transcriptions from dialogues in the Tunisian Railway Transport Network. The phonetic (or pronunciation) dictionary is an important ASR component that s...
متن کاملCollaboratively Constructed Linguistic Resources for Language Variants and their Exploitation in NLP Application - the case of Tunisian Arabic and the Social Media
Modern Standard Arabic (MSA) is the formal language in most Arabic countries. Arabic Dialects (AD) or daily language differs from MSA especially in social media communication. However, most Arabic social media texts have mixed forms and many variations especially between MSA and AD. This paper aims to bridge the gap between MSA and AD by providing a framework for the translation of texts of soc...
متن کاملThe Effects of Factorizing Root and Pattern Mapping in Translating between Tunisian Arabic and Standard Arabic
The development of natural language processing tools for dialects faces the severe problem of lack of resources. In cases of diglossia, as in Arabic, one variant, Modern Standard Arabic (MSA), has many resources that can be used to build natural language processing tools. Whereas other variants, Arabic dialects, are resource poor. Taking advantage of the closeness of MSA and its dialects, one w...
متن کامل